!pip install imblearn
!pip install delayed
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
Requirement already satisfied: imblearn in c:\users\james\anaconda3\lib\site-packages (0.0) Requirement already satisfied: imbalanced-learn in c:\users\james\anaconda3\lib\site-packages (from imblearn) (0.8.0) Requirement already satisfied: numpy>=1.13.3 in c:\users\james\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.19.2) Requirement already satisfied: scipy>=0.19.1 in c:\users\james\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.5.2) Requirement already satisfied: scikit-learn>=0.24 in c:\users\james\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.24.1) Requirement already satisfied: joblib>=0.11 in c:\users\james\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (0.17.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\james\anaconda3\lib\site-packages (from scikit-learn>=0.24->imbalanced-learn->imblearn) (2.1.0) Requirement already satisfied: delayed in c:\users\james\anaconda3\lib\site-packages (0.11.0b1) Requirement already satisfied: redis in c:\users\james\anaconda3\lib\site-packages (from delayed) (3.5.3) Requirement already satisfied: hiredis in c:\users\james\anaconda3\lib\site-packages (from delayed) (2.0.0)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn import metrics
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier
)
from xgboost import XGBClassifier
# Loading the dataset - sheet_name parameter is used if there are multiple tabs in the excel file.
data = pd.read_csv("BankChurners.csv")
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.shape
(10127, 21)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 10127 non-null object 6 Marital_Status 10127 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# a check to see if there are any null values
data.isnull().values.any()
False
• The dataset has 21 columns and 10127 observations
• All columns have 10217 observations, meaning that there are no null values i.e. no columns with missing values
• The isnull function confirmed this for us.
data.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 7 Marital_Status 4 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
• Age has 45 unique values i.e. most of the customers are of similar age.
• Majoirty of the numerical data types are continuous.
• Since all the values in ID column are unique we can drop it
# dropping the CLIENTNUM variable
data.drop(["CLIENTNUM"],axis=1,inplace=True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
• A lot of the continuous variables look to have outliers, given there is a large difference between quartile 3 and the max data point.
• This includes credit limit, revolving balance, open to buy, total transaction amount and count.
• Further insights of this will be gained in the Exploratory Data Analysis (EDA).
dupes = data.duplicated()
sum(dupes)
0
• There does not appear to be any duplicate rows in the dataset.
• Various columns are of type object. We can change them to categories.
• Converting "objects" to "category" reduces the data space required to store the dataframe
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
data.head(10)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
| 5 | Existing Customer | 44 | M | 2 | Graduate | Married | $40K - $60K | Blue | 36 | 3 | 1 | 2 | 4010.0 | 1247 | 2763.0 | 1.376 | 1088 | 24 | 0.846 | 0.311 |
| 6 | Existing Customer | 51 | M | 4 | Unknown | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.0 | 2264 | 32252.0 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 7 | Existing Customer | 32 | M | 0 | High School | Unknown | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.0 | 1396 | 27685.0 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 8 | Existing Customer | 37 | M | 3 | Uneducated | Single | $60K - $80K | Blue | 36 | 5 | 2 | 0 | 22352.0 | 2517 | 19835.0 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 9 | Existing Customer | 48 | M | 2 | Graduate | Single | $80K - $120K | Blue | 36 | 6 | 3 | 3 | 11656.0 | 1677 | 9979.0 | 1.524 | 1441 | 32 | 0.882 | 0.144 |
data["Attrition_Flag"] = data["Attrition_Flag"].astype("category")
data["Gender"] = data["Gender"].astype("category")
data["Education_Level"] = data["Education_Level"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["Income_Category"] = data["Income_Category"].astype("category")
data["Card_Category"] = data["Card_Category"].astype("category")
Given that Attrition_Flag is the target variable, which is categorical we will input 1 and 0 for the applicable classes we are concerned with. This will also allow us to compare variables easier for the EDA
data['Attrition_Flag'] = data['Attrition_Flag'].replace('Attrited Customer',1)
data['Attrition_Flag'] = data['Attrition_Flag'].replace('Existing Customer',0)
# confirm we now only have 2 unique values.
print(data.Attrition_Flag.value_counts())
0 8500 1 1627 Name: Attrition_Flag, dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null category 5 Marital_Status 10127 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
• We can see that the 'object' data types have now been converted to 'category'
• Attrition_Flag just has 2 classes; 1 (Attrited Customer) and 0 (Existing Customer)
• We can see that the memory usage has decreased from 1.6+MB to 1.1MB.
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, color="orange"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2, color="tab:cyan"
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="purple", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data["Customer_Age"])
• The distribution of customer age from the histogram looks relatively normal, with mean and median relatively alligned.
• However, the boxplot shows that there are 2 outliers at the right end with a max of 73. This leads to a very slight right skew.
• We will not treat these outliers as they represent the real market trend of age, with respect to the banks customers.
histogram_boxplot(data["Months_on_book"])
• The distribution for months on the book has an extreme peak at the mean/median value. Both of which are around 36.
• This indicates that the majority of customers have about 3 years (i.e. 36 months) on the books.
• The box plot indicates that there are outliers at both the left and right whiskers, which gives the distribution both left and right skews.
histogram_boxplot(data["Credit_Limit"])
• The distribution of credit limit has an extreme right skew. There is a large gap between the mean and median as a result
• There are outliers as displayed by the boxplot. A large portion of these outliers are in the range of 34K - 35K as displayed by the histogram
• Given there are a lot of cutomers in this range, we will not treat these outliers as they represent the real market trend of credit limit with respect to the bank.
histogram_boxplot(data["Total_Revolving_Bal"])
• There does not appear to be outliers among revolving balance. The dispersion looks to fluctuate quite a bit.
• There is a large portion of customers that do not appear to have a revolving balance as shown by the histogram.
• Of those that do have a revolving balance, the highest portion of customers have a balance of roughly 2500.
histogram_boxplot(data["Avg_Open_To_Buy"])
• Average open to buy is the difference between the credit limit and the revolving balance. Given the high portion of outliers among both of these variables, we would expect high outliers here. This is clear per the box plot.
• As a result, there is a strong right skew for average open to buy.
• Given this variable is dependent on 2 other independent variables and we will not be treating the outliers for those other 2 variables, we will not treat the outliers of average open to buys.
histogram_boxplot(data["Total_Trans_Amt"])
• The dispersion of total transaction amount also has a large right skew
• Various outliers can be seen outside the right whisker as can be seen from the boxplot, leading to this large skew.
• We will not treat these outliers as they represent the real market trend
histogram_boxplot(data["Total_Trans_Ct"])
• The dispersion of the count of transactions over the last 12 months looks to be more normally distributed than some of the other independent variables.
• There are just 2 outliers as can been seen from the boxplot.
• We will not treat these outliers as they represent the real market trend
histogram_boxplot(data["Avg_Utilization_Ratio"])
• Average utilization ratio is the ratio of your revolving balance to your overall balance availble i.e Average Open to Buy.
• Given the strong right skew of average open to buy, we would expect similar trend here. This can be seen from both plots.
• There does not appeat to be any noticable outliers.
# Function to create barplots that indicate percentage for each category.
# This will help provide observations of some of the categorical data types.
def perc_on_bar(plot, feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Attrition_Flag"],palette='winter')
perc_on_bar(ax,data["Attrition_Flag"])
• The class distribution in the target variable is imbalanced.
• We have 84% (rounded) observations for existing customers and 16% (rounded) observations for attributed customers. This was also clear earlier when we looked at the shape of the data. However, a visual often gives a good representation just how large that difference is.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Gender"],palette='winter')
perc_on_bar(ax,data["Gender"])
• Female customers are taking more credit than male customers, but not by much.
• There are approx 53% female customers and 47% of male customers
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Dependent_count"],palette='winter')
perc_on_bar(ax,data["Dependent_count"])
• Majority of the customers have approximately 3 dependents (27%), followed closely by 2 dependents (26.2%).
• Roughly 9% of customers do not have any depdents and 4% of customers have 5 dependents.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Education_Level"],palette='winter')
perc_on_bar(ax,data["Education_Level"])
• Majorority of the customers (roughly 31%) have a graduate degree.
• This is followed by high school level cert at roughly 20% (ronded).
• Relatively high portions of the customers are either uneducated (14.7%) or we do not know their level of educations i.e. unknown (15%)
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Marital_Status"],palette='winter')
perc_on_bar(ax,data["Marital_Status"])
• Majorority of the customers are either married at 46.3% of single at 38.9%
• This is considerably higher than the other classes, as 7.4% of customers are divorced and 7.4% we do not know their marital status.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Income_Category"],palette='winter')
perc_on_bar(ax,data["Income_Category"])
• Majorority of customers earn less than $40K at 35% (rounded).
• 7% (rounded) of customers earn in excess of $120K. This would be expected as it is considerably greated than the average wage
• There is little variance between some of the other classes. There is however 11% of customers that we are unaware of their income level.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Card_Category"],palette='winter')
perc_on_bar(ax,data["Card_Category"])
• Unsurprisingly, the blue credit card is the dominant card among customes (roughly 93%). This would be expected, as they are the more commonly known, basic & affordable cards offered by banks.
• Silver makes up 5.5% of the customers credit cards and both gold and platinum make up just 1.3% of the remaining portion of the customers cards. Again, this would be expected given they are more expensive and generally consist of higher interest rates.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Total_Relationship_Count"],palette='winter')
perc_on_bar(ax,data["Total_Relationship_Count"])
• Roughly 23% (rounded) of customers have 3 relationshipes with the bank. Examples may include a debit, savings and credit account.
• There is consistency among customers with 4,5 and 6 relationships, all consisting of 18-19% of the customers. They account for a combined total of roughly 57% of the banks customers.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Months_Inactive_12_mon"],palette='winter')
perc_on_bar(ax,data["Months_Inactive_12_mon"])
• 3 months of inactivity accuonts for the highest amount of customers, at 38%. This is followed closely by customers with 2 months of inactivity (accounts for 32.4% of customers).
• 0 months of inactivity accounts for little to no customers. This implies that more of less all customers have had at least 1 month of inactivity.
plt.figure(figsize=(15,5))
ax = sns.countplot(data["Contacts_Count_12_mon"],palette='winter')
perc_on_bar(ax,data["Contacts_Count_12_mon"])
• 2 and 3 contacts with the bank account for the largest portion of the customers. They account for 31.9% and 33.4% of customers.
sns.pairplot(data, hue="Attrition_Flag",diag_kind = 'kde',height=1.5)
<seaborn.axisgrid.PairGrid at 0x1a92116b340>
• A positive linear realtionship is evident between credit limit and average open to buy.
• Postive correlation is also evident between other variables, such as age and the average months on the books. We will discuss some of these relationships further, towards the end of the EDA.
# boxplot to be able to visualize various variables and the relationship with ProdTaken
cols = data[['Customer_Age','Credit_Limit','Total_Trans_Amt','Months_on_book']].columns.tolist()
plt.figure(figsize=(14,12))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition_Flag"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
Attrition Flag & Age - we see that there is little difference with the age of existing customers verus Attrited Customer. However, the outliers that we previously saw are both existing customers.
Attrition Flag & Credit Limit - We can see that third quartile amount of exiting customers is higher than the third quartile renounced customers. This shows that the credit limit for exisitng customers is higher. This would be expected, as exisitng customers are more loyal customers and the bank would be more inclined to give them a loan of higher amount. There are noticables outliers among both classes.
Attrition Flag & Total Transaction Amount - We can see that all quartiles of the exiting customers are higher than that of renounced customers. This would be expected, given the decline in the number of users with credit cards. Also, customers with cards i.e. exisiting customers will have more tranactions. There are outliers in boxplots of both class distributions
Attrition Flag & Months on the Book - There is little difference between the medians for both exisitng customers and attrited customers. We saw early that this value was 36 which looks consistent here and would imply that a 3 year loan is quite popular among customers. There are outliers in boxplots of both class distributions.
# boxplot to be able to visualize various variables and the relationship with ProdTaken
cols = data[['Total_Trans_Ct','Total_Amt_Chng_Q4_Q1','Total_Revolving_Bal','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']].columns.tolist()
plt.figure(figsize=(14,12))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition_Flag"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
Attrition Flag & Total Transaction Count - We can see that all quartiles of the exiting customers are much higher than that of renounced customers. This matches the results we saw among the total tranaction amount, which would be expected. We would expect exisiting customers to have a higher amount of transactions than that of attrited customers. Its good to see that the data verifies this. There are outliers in the boxplots of both class distributions. There are also outliers towards 0, primarily for the attrited customers.
Attrition Flag & Total Revolving Balance - We can see that all quartiles of the exiting customers are much higher than that of renounced customers. In fact a high portion of the attrited customers have a revolving balance of 0 which makes it difficult for the graph to dipict a median value (more than likely is 0 but its not clear from the graph). This would also be expected as existing customers are more likely to carry over balance from one month to the next. There are no clear outliers among the distributions.
Attrition Flag & Total_Amt_Chng_Q4_Q1 - Interestingly the difference between the median values when comparing Q4 to Q1 has reduced (in comparison to the total transaction amounts). It is now more alligned with existing customers. This may be as a result of attrited customers recommencig using their card. There are clear outlliers among both classes
Attrition Flag and Total_Ct_Chng_Q4_Q1 - Similarly, the quartiles of the exiting customers and attrited customers are more alligned when comparing transaction count between Q1 & Q4. There are clear outliers among the distributions
Attrition Flag & Avg_Utilization_Ratio - We can see that all quartiles of the exiting customers are much higher than that of renounced customers when looking at the utilization ratio. We can see similar results to that of the revolving balance, with the ratio for most customers leaning towards 0. Given this, there are some outliers for attibuted customers as can be seen.
## Function to plot stacked bar chart
def stacked_plot(x):
sns.set(palette="nipy_spectral")
tab1 = pd.crosstab(x, data["Attrition_Flag"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["Attrition_Flag"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(10, 5))
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.03), ncol=2)
#plt.legend(loc='lower left', frameon=False)
#plt.legend(loc="upper left", bbox_to_anchor=(0,1))
plt.show()
stacked_plot(data["Gender"])
Attrition_Flag 1 0 All Gender F 930 4428 5358 M 697 4072 4769 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• Earlier we saw there are more females than males. This shows that a higher percentage of female customers are more likely to renounce their account compared to males. Although the difference is minor.
stacked_plot(data["Marital_Status"])
Attrition_Flag 1 0 All Marital_Status Divorced 121 627 748 Married 709 3978 4687 Single 668 3275 3943 Unknown 129 620 749 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• There are no significant difference with respect to marital status.
• However, singles and unknowns look to have slighly more attrited customers than that of divorced and married customers.
stacked_plot(data["Education_Level"])
Attrition_Flag 1 0 All Education_Level College 154 859 1013 Doctorate 95 356 451 Graduate 487 2641 3128 High School 306 1707 2013 Post-Graduate 92 424 516 Uneducated 237 1250 1487 Unknown 256 1263 1519 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• There are not significant difference with respect to the education level of customers, other than that of doctorates.
• Ironically doctorates appear to be the most likely to renounce their credit card.
stacked_plot(data["Income_Category"])
Attrition_Flag 1 0 All Income_Category $120K + 126 601 727 $40K - $60K 271 1519 1790 $60K - $80K 189 1213 1402 $80K - $120K 242 1293 1535 Less than $40K 612 2949 3561 Unknown 187 925 1112 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• There is no significant difference across the different levels of income.
• However, customers that earn less than $40k look most likely to renounce their cred cards.
stacked_plot(data["Card_Category"])
Attrition_Flag 1 0 All Card_Category Blue 1519 7917 9436 Gold 21 95 116 Platinum 5 15 20 Silver 82 473 555 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• Platinum card holders look most likely to renounce their credit cards, followed by gold card holders.
stacked_plot(data["Total_Relationship_Count"])
Attrition_Flag 1 0 All Total_Relationship_Count 1 233 677 910 2 346 897 1243 3 400 1905 2305 4 225 1687 1912 5 227 1664 1891 6 196 1670 1866 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• Customers with 2 bank realtionships look more probable to renounce their credit card, followed by customers with 1 relationship. A relationship could be taken as the number of accounts or loans with the bank.
• Customers with more relationships appear less likely to renounce. This would imply these customers are more loyal customers for the bank.
stacked_plot(data["Contacts_Count_12_mon"])
Attrition_Flag 1 0 All Contacts_Count_12_mon 0 7 392 399 1 108 1391 1499 2 403 2824 3227 3 681 2699 3380 4 315 1077 1392 5 59 117 176 6 54 0 54 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• There looks to be an increasing positive correlation with the number of contacts with bank, and the likely hood for a customer to renounce their credit card. Customers that have been contacted 6 times are all attributed customers.
• This would imply that the longer the customer has no used their credit card, the more likely it is for the bank to contact them.
stacked_plot(data["Months_Inactive_12_mon"])
Attrition_Flag 1 0 All Months_Inactive_12_mon 0 15 14 29 1 100 2133 2233 2 505 2777 3282 3 826 3020 3846 4 130 305 435 5 32 146 178 6 19 105 124 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
• There looks to be an increasing positive correlation with the months of inactivity and the likely hood for a customer to renounce their credit card. This is consistent from 1-4 months of inactivity, but we then notice a decreasing trend.
• This implies that some customers begin to reuse their credit card after 4 months of inactivity.
sns.set(rc={"figure.figsize": (10, 10)})
sns.heatmap(
data.corr(),
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
From the heatmap we can see that there is strong correlation between some of the independent variables. This was expected, given that some of these variables are dependent on one another from the outset.. For example:
Average Utlization Ratio - this is the ratio of your revolving balance to your overall balance availble i.e Average Open to Buy. Given this, there is a strong positive correlation between this dependent variable and the Total Revoving Balance (as the revolving balance increases for a customer the utlization ration will increase). On the contrary, there is negative correlation with this variable and the average open to buy.
Total Transaction Amount - this displays high correlation with the total transaction count. We also saw this from the boxplots earlier. This would be expected, as when a customer has more transaction its expected that they would spend more i.e. transaction amount will increases. Similary the Q4-Q1 comparisons show a postiive correlation, but this is not as strong a correlation.
Months on Book - strong positive correlation with age. Again this would be expected, as the older someone gets the more probable it is they will increase the amount of time they are with the bank.
Credit Limit - shows 100% postive correlation with average open to buy. That is becuase when the revolving balance is zero, these metrics are equivalent. Given this, its also going to have a negative correlation with the utilization ratio, as we saw for average open to buy.
• From the EDA there are some variables with a relatively high percentage of values that are 'Unknown' classes.
• Given this relatively high percentage, I will treat these as null values. I will convert these calues to NaN and then I will use KNN imputer to impute these missing values
• Also, I am choosing not to treat outliers for this project, as the outliers in this dataset represent the real market trend for the bank. However, it is worth noting that ideally it would be good to have model comparisons. If we had the time, we could compute the models with outliers and also compare this to models where we treat the outlier values. This is a bias decision, but I am happy to proceed for the purpose of this project.
data = data.replace('Unknown', np.nan)
data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
• Educational level, marital status and income category all have null values now that we have transformed the unknowns.
imputer = KNNImputer(n_neighbors=5)
reqd_col_for_impute = ['Marital_Status','Education_Level','Income_Category']
data[reqd_col_for_impute].head()
| Marital_Status | Education_Level | Income_Category | |
|---|---|---|---|
| 0 | Married | High School | $60K - $80K |
| 1 | Single | Graduate | Less than $40K |
| 2 | Married | Graduate | $80K - $120K |
| 3 | NaN | High School | Less than $40K |
| 4 | Married | Uneducated | $60K - $80K |
education = {'Uneducated':1,'High School':2, 'College':3, 'Graduate':4,'Post-Graduate':5,'Doctorate':6}
data['Education_Level']= data['Education_Level'].map(education)
marital_status = {'Married':1,'Single':2, 'Divorced':3}
data['Marital_Status']=data['Marital_Status'].map(marital_status)
income = {'Less than $40K':1, '$40K - $60K':2, '$60K - $80K':3,'$80K - $120K':4,'$120K +':5}
data['Income_Category']=data['Income_Category'].map(income)
card_cat = {'Blue':1,'Gold':2, 'Silver':3, 'Platinum':4}
data['Card_Category']=data['Card_Category'].map(card_cat)
data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | M | 3 | 2 | 1 | 3 | 1 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 49 | F | 5 | 4 | 2 | 1 | 1 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 51 | M | 3 | 4 | 1 | 4 | 1 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 40 | F | 4 | 2 | NaN | 1 | 1 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 40 | M | 3 | 1 | 1 | 3 | 1 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Values have been encoded.
Given the high correlation between some of the independent variables, we will look to drop these variables. High correlation with dependent variables can increase the bias of a data set when building the model.
# Separating target variable and other variables
X = data.drop(columns="Attrition_Flag")
y = data["Attrition_Flag"]
# Utilization Ratio has a high correlation among various variables. We will look to drop this
# both the transaction count and the change in transaction count are highly correlated with the total transaction amounts
# we will drop the count variables
# also from the EDA analysis there was little to no difference between males & females for renouncing their card versus
# being an exisitng customer. I will drop the gender varibale
X.drop(
columns=[
"Avg_Utilization_Ratio",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Gender",
],
inplace=True,
)
# split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
#Fit and transform the train data
X_train[reqd_col_for_impute]=imputer.fit_transform(X_train[reqd_col_for_impute])
#Transform the test data
X_test[reqd_col_for_impute]=imputer.transform(X_test[reqd_col_for_impute])
#Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Customer_Age 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 dtype: int64 ------------------------------ Customer_Age 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 dtype: int64
• All missing values have been treated.
• Let's inverse map the encoded values.
## Function to inverse the encoding
def inverse_mapping(x,y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
inverse_mapping(marital_status,'Marital_Status')
inverse_mapping(education,'Education_Level')
inverse_mapping(income,'Income_Category')
• Checking inverse mapped values/categories.
cols = X_train.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_train[i].value_counts())
print('*'*30)
Graduate 2177 High School 1425 Uneducated 1031 College 709 Post-Graduate 364 Doctorate 312 Name: Education_Level, dtype: int64 ****************************** Married 3301 Single 2771 Divorced 502 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 2489 $40K - $60K 1254 $80K - $120K 1084 $60K - $80K 974 $120K + 503 Name: Income_Category, dtype: int64 ****************************** 1 6621 3 375 2 78 4 14 Name: Card_Category, dtype: int64 ******************************
cols = X_test.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_train[i].value_counts())
print('*'*30)
Graduate 2177 High School 1425 Uneducated 1031 College 709 Post-Graduate 364 Doctorate 312 Name: Education_Level, dtype: int64 ****************************** Married 3301 Single 2771 Divorced 502 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 2489 $40K - $60K 1254 $80K - $120K 1084 $60K - $80K 974 $120K + 503 Name: Income_Category, dtype: int64 ****************************** 1 6621 3 375 2 78 4 14 Name: Card_Category, dtype: int64 ******************************
X_train=pd.get_dummies(X_train,drop_first=True)
X_test=pd.get_dummies(X_test,drop_first=True)
print(X_train.shape, X_test.shape)
(7088, 25) (3039, 25)
Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix. Functions make sense in this case as we will be creating various different models throughout this project.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
False Negatives: Reality: A customer renounced their credit card. Model predicted: A customer did NOT renounce their credit card account. Outcome: a loss of income for the bank as a result of the customer renouncing their account.
For this project, our goal is to be able to predict which type of customer will renounce their credit card account and provide insight/recommendations for the bank. In this case, not being able to identify a potential customer that will renounce their credit card is the biggest loss to the bank. Minimizing this loss, will essentially save money for the bank as they can anticiapte any renouncers ahead of time, allowing the bank to improve in specific areas that may prevent this. Hence, recall is the right metric to check the performance of the model. The bank will want recall to be maximized i.e. we need to reduce the number of false negatives. Recall gives the ratio of True positives to Actual positives, so high Recall implies low false negatives
Given that there is a large bias in the data i.e. the percentge of existing customers is significantly greater than that of attrited customers, models are going to be bias towards the dominant class i.e. exisiting customers. This implies that it will be difficult from the outset for models to predict renouncers (our overall goal). We will need to use methods of sampling as well as hyper tuning to try improve the performance of models.
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
y_predict = lr.predict(X_test)
lr_score = lr.score(X_test, y_test)
print(lr_score)
0.8733135899967095
test_pred = lr.predict(X_test)
print(metrics.classification_report(y_test, test_pred))
print(metrics.confusion_matrix(y_test, test_pred))
precision recall f1-score support
0 0.88 0.98 0.93 2551
1 0.75 0.32 0.44 488
accuracy 0.87 3039
macro avg 0.82 0.65 0.69 3039
weighted avg 0.86 0.87 0.85 3039
[[2500 51]
[ 334 154]]
Let's evaluate the model performance by using KFold and cross_val_score
K-Folds cross-validator provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_bfr)
plt.show()
• Performance on the training set varies between 0.20 and 0.345. This is extremely low.
• Lets check the performance on the test set
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(lr,y_test)
Accuracy on training set : 0.8789503386004515 Accuracy on test set : 0.8733135899967095 Recall on training set : 0.35118525021949076 Recall on test set : 0.3155737704918033 Precision on training set : 0.7707129094412332 Precision on test set : 0.751219512195122
• Logistic Regression has given a relatively poor performance on the training and test outputs.
• Recall is very low and thus the model is relatively poor at predicting those that would renounce verus not renounce
• We can use methods of upsampling and downsampling to see if we can improve the modeel.
print("Before UpSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())
print("After UpSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_train_res==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
Before UpSampling, counts of label '1': 1139 Before UpSampling, counts of label '0': 5949 After UpSampling, counts of label '1': 5949 After UpSampling, counts of label '0': 5949 After UpSampling, the shape of train_X: (11898, 25) After UpSampling, the shape of train_y: (11898,)
log_reg_over = LogisticRegression(random_state = 1)
# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_res,y_train_res)
LogisticRegression(random_state=1)
# fit model on upsampled data
lr.fit(X_train_res, y_train_res)
y_predict = lr.predict(X_test)
lr_score = lr.score(X_test, y_test)
print(lr_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
0.7462981243830207
[[1962 589]
[ 182 306]]
precision recall f1-score support
0 0.92 0.77 0.84 2551
1 0.34 0.63 0.44 488
accuracy 0.75 3039
macro avg 0.63 0.70 0.64 3039
weighted avg 0.82 0.75 0.77 3039
• K-Folds cross-validator provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_res=cross_val_score(estimator=log_reg_over, X=X_train_res, y=y_train_res, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_res)
plt.show()
• Performance of model on training set varies between 0.7385 to 0.7885, which is an improvement from the initial model(without upsampling).
• Let's check the performance on the test set.
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train_res,X_test,y_train_res,y_test)
# creating confusion matrix
make_confusion_matrix(lr,y_test)
Accuracy on training set : 0.7688687174315011 Accuracy on test set : 0.7462981243830207 Recall on training set : 0.7565977475205917 Recall on test set : 0.6270491803278688 Precision on training set : 0.7756332931242461 Precision on test set : 0.3418994413407821
# Choose the type of classifier.
log_reg_over = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(log_reg_over, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_res, y_train_res)
# Set the clf to the best combination of parameters
log_reg_over = grid_obj.best_estimator_
# Fit the best algorithm to the data.
log_reg_over.fit(X_train_res, y_train_res)
LogisticRegression(C=0.1, random_state=1, solver='saga')
#Calculating different metrics
#Calculating different metrics
get_metrics_score(log_reg_over,X_train_res,X_test,y_train_res,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_over,y_test)
Accuracy on training set : 0.7069255337031434 Accuracy on test set : 0.7772293517604475 Recall on training set : 0.582114641116154 Recall on test set : 0.5307377049180327 Precision on training set : 0.7757616487455197 Precision on test set : 0.36633663366336633
• Taking the approach of upsampling (which adds synthethic data to the smaller class) on the logistic regression has greatly improved the perfomrance of the model. The recall has increased considerably.
• The model is now better at predicting renouncers versus non renouncers. The confusion matrix percentages provide clear evidence of this, given the reduced percentage of false negatives (bottom left section).
• However, it is worth noting that there is now some evidence of overfitting between the recall training and test results.
• Taking the approach of regularization we are able to reduce this overfitting, but the recall performance is reduced.
• We will now compare the method of upsampling (adding synthetic data) of the smaller class to that of down sampling of the larger class. This is another aproach that can be taken to reduce the imbalance within a dataset.
data_existing_indices = data[data['Attrition_Flag'] == 0].index # Get the record numbers of exisitng cases
no_existing = len(data[data['Attrition_Flag'] == 0]) # how many exiting custmer cases
print(no_existing)
data_Attrited_indices = data[data['Attrition_Flag'] == 1].index # record number of attrited cases
no_Attrited = len(data[data['Attrition_Flag'] == 1]) # how many attrited cases
print(no_Attrited)
8500 1627
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
• There is significant bias for the existing customer class, as has been discussed throughout this project. We will down sample this class using random
print("Before Down Sampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before Down Sampling, counts of label 'No': {} \n".format(sum(y_train==0)))
print("After Down Sampling, counts of label 'Yes': {}".format(sum(y_train_un==1)))
print("After Down Sampling, counts of label 'No': {} \n".format(sum(y_train_un==0)))
print('After Down Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Down Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Down Sampling, counts of label 'Yes': 1139 Before Down Sampling, counts of label 'No': 5949 After Down Sampling, counts of label 'Yes': 1139 After Down Sampling, counts of label 'No': 1139 After Down Sampling, the shape of train_X: (2278, 25) After Down Sampling, the shape of train_y: (2278,)
log_reg_under = LogisticRegression(random_state = 1)
log_reg_under.fit(X_train_un,y_train_un )
LogisticRegression(random_state=1)
Let's again evaluate the model performance by using KFold and cross_val_score
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=log_reg_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()
• Performance of model on training set varies between .71825 to 0.75575, which is an improvement from the initial model(without any resampling).
• Let's check the performance on the test set.
#Calculating different metrics
get_metrics_score(log_reg_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_under,y_test)
Accuracy on training set : 0.7326602282704127 Accuracy on test set : 0.7499177360974004 Recall on training set : 0.7348551360842844 Recall on test set : 0.694672131147541 Precision on training set : 0.7316433566433567 Precision on test set : 0.3568421052631579
• Model performance has improved using downsampling - referring to the confusion matrix the logistic regression is now better at differentiating between positive and negative classes.
• The recall has again increased substantially and there is less evidence of overfitting given there is little difference between the recall training and test results.
• However, it is worth noting that there is evidence of overfitting for precision. Given this is not our metric of concern we will not use any regularization here. This would be a step we could take to cap certain coefficient values, however given there is no large evidence of recall overfitting, we will not proceed with regularization.
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"LR",
Pipeline(
steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1)),
]
),
)
)
models.append(
(
"RF",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"GBM",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"ADB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
]
),
)
)
models.append(
(
"DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
LR: 34.941649277378474 RF: 69.88445784063686 GBM: 74.71172424453204 ADB: 71.02790014684288 XGB: 82.17482031068862 DTREE: 73.65831980833141
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
• We can see that XGBoost is giving the highest cross validated recall followed by gradient boosting and decision tree. For XGB there are outliers above an beneath the model.
• In an ideal scenario I would proceed with the 3 highes recall scores. However, given the computational time for gradient boosting, I have chosen to proceed with XGB, ADB and the DTREE as the models of choice. This will be consistent for both GridSearch and Randomized Search
• Given the good performnce of these 3 modes in terms of the recall cross validation score, we will select these models to hypertune.
We will use pipelines with StandardScaler and each model indivudally, to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.
We can also use make_pipeline function instead of Pipeline to create a pipeline.
make_pipeline: This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
First, let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1), 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__n_estimators': 70}
Score: 0.8121068088724013
Wall time: 6min 17s
# Creating new pipeline with best parameters
abc_tuned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned1)
# Creating confusion matrix
make_confusion_matrix(abc_tuned1, y_test)
Accuracy on training set : 0.9816591422121896 Accuracy on test set : 0.9499835472194801 Recall on training set : 0.9280070237050044 Recall on test set : 0.8299180327868853 Precision on training set : 0.9565610859728507 Precision on test set : 0.8544303797468354
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'xgbclassifier__gamma': 3, 'xgbclassifier__learning_rate': 0.05, 'xgbclassifier__n_estimators': 100, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__subsample': 0.7} with CV score=0.9385423912203417:
Wall time: 1h 12min 30s
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.9,
learning_rate=0.01,
gamma=5,
eval_metric='logloss',
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned1)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned1, y_test)
Accuracy on training set : 0.9104119638826185 Accuracy on test set : 0.8897663705166173 Recall on training set : 0.9657594381035997 Recall on test set : 0.9159836065573771 Precision on training set : 0.6485849056603774 Precision on test set : 0.6032388663967612
%%time
#Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"decisiontreeclassifier__criterion": ['gini','entropy'],
"decisiontreeclassifier__max_depth": [3, 4, 5, None],
"decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': None, 'decisiontreeclassifier__min_samples_split': 2}
Score: 0.7655885307983615
Wall time: 5.69 s
# Creating new pipeline with best parameters
dtree_tuned1 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(random_state=1, criterion='gini', max_depth=None, min_samples_split=4),
)
# Fit the model on training data
dtree_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(min_samples_split=4, random_state=1))])
# Calculating different metrics
get_metrics_score(dtree_tuned1)
# Creating confusion matrix
make_confusion_matrix(dtree_tuned1, y_test)
Accuracy on training set : 0.9950620767494357 Accuracy on test set : 0.9312273774267851 Recall on training set : 0.9771729587357331 Recall on test set : 0.8012295081967213 Precision on training set : 0.9919786096256684 Precision on test set : 0.7773359840954275
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
abc_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
abc_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(abc_tuned2.best_params_,abc_tuned2.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 90, 'adaboostclassifier__learning_rate': 0.2, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8103717443388205:
Wall time: 2min
# Creating new pipeline with best parameters
abc_tuned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned2)
# Creating confusion matrix
make_confusion_matrix(abc_tuned2, y_test)
Accuracy on training set : 0.9816591422121896 Accuracy on test set : 0.9499835472194801 Recall on training set : 0.9280070237050044 Recall on test set : 0.8299180327868853 Precision on training set : 0.9565610859728507 Precision on test set : 0.8544303797468354
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss', n_estimators = 50))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),
'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05],
'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1],
'xgbclassifier__max_depth':np.arange(1,10,1),
'xgbclassifier__reg_lambda':[0,1,2,5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 0.8, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__reg_lambda': 10, 'xgbclassifier__n_estimators': 100, 'xgbclassifier__max_depth': 2, 'xgbclassifier__learning_rate': 0.2, 'xgbclassifier__gamma': 5} with CV score=0.9490841641548805:
Wall time: 1min 38s
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=200,
scale_pos_weight=10,
gamma=1,
subsample=0.9,
learning_rate= 0.01,
eval_metric='logloss', max_depth = 2, reg_lambda = 2
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=1, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=2,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=2, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned2)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned2, y_test)
Accuracy on training set : 0.7447799097065463 Accuracy on test set : 0.7351102336294834 Recall on training set : 0.9438103599648815 Recall on test set : 0.9344262295081968 Precision on training set : 0.38120567375886527 Precision on test set : 0.371033360455655
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"decisiontreeclassifier__criterion": ['gini','entropy'],
"decisiontreeclassifier__max_depth": [3, 4, 5, None],
"decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'decisiontreeclassifier__min_samples_split': 2, 'decisiontreeclassifier__max_depth': None, 'decisiontreeclassifier__criterion': 'entropy'} with CV score=0.7612218873174126:
Wall time: 2.94 s
# Creating new pipeline with best parameters
dtree_tuned2 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(random_state=1, criterion='gini', max_depth=None, min_samples_split=7),
)
# Fit the model on training data
dtree_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(min_samples_split=7, random_state=1))])
# Calculating different metrics
get_metrics_score(dtree_tuned2)
# Creating confusion matrix
make_confusion_matrix(dtree_tuned2, y_test)
Accuracy on training set : 0.9864559819413092 Accuracy on test set : 0.9325435998683778 Recall on training set : 0.9473222124670764 Recall on test set : 0.7950819672131147 Precision on training set : 0.967713004484305 Precision on test set : 0.7870182555780934
The models of choice were AdaBoost, XGBoost and the Decision Tree. We have computed each of these models and reviewed the performance of these models. Our main metric of concern is recall. We are looking for the highest recall values, which minimizes any overfitting and minimizes the percentage of false negatives. Hypertuning these models using both GridSearch & Randomized search through pipelines has also been completed, which has greatly improved the performance of these models in terms of the recall. I will provide a summary table to compare these results.
# defining list of models
models = [abc_tuned1, abc_tuned2,xgb_tuned1, xgb_tuned2, dtree_tuned1, dtree_tuned2]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model, False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"AdaBoost tuned with GridSearchCV",
"AdaBoost tuned with RandomizedSearchCV",
"XGBoost tuned with GridSearchCV",
"XGBoost tuned with RandomizedSearchCV",
"Decision tree tuned with GridSearchCV",
"Decision tree tuned with RandomizedSearchCV"
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 3 | XGBoost tuned with RandomizedSearchCV | 0.744780 | 0.735110 | 0.943810 | 0.934426 | 0.381206 | 0.371033 |
| 2 | XGBoost tuned with GridSearchCV | 0.910412 | 0.889766 | 0.965759 | 0.915984 | 0.648585 | 0.603239 |
| 0 | AdaBoost tuned with GridSearchCV | 0.981659 | 0.949984 | 0.928007 | 0.829918 | 0.956561 | 0.854430 |
| 1 | AdaBoost tuned with RandomizedSearchCV | 0.981659 | 0.949984 | 0.928007 | 0.829918 | 0.956561 | 0.854430 |
| 4 | Decision tree tuned with GridSearchCV | 0.995062 | 0.931227 | 0.977173 | 0.801230 | 0.991979 | 0.777336 |
| 5 | Decision tree tuned with RandomizedSearchCV | 0.986456 | 0.932544 | 0.947322 | 0.795082 | 0.967713 | 0.787018 |
• For AdaBoost the performance is the same from both Grid Search and Randomized Search. Decision tree has better recall performance using GridSearch but the difference is minimal.
• The model of best performance is the XGBoost through randomized search with a test recall performance of .93. Also there is no evidence of overfitting. The peerformance is still very good using GridSearch.
• Its also worth noting that the time taken for randomized search was significantly less than that of grid search. This was particulary noticable for XGBoost - this took 1h 12min 30s with Gridsearch and just 1min 38s with randomized search. Randomized search was also less for both AdaBoost and the decision tree classifier. Given that GridSearch is computationally exhaustive using randomized search is the better methor to proceed with, as results tend to be as good if not better, and they are computed in significantly less time.
• Given that XGBoost was the model of best performance we will take a look at the feature importance from the tuned xgboost model. We will also take a look at the feature importance of the other models, to see if there is consistency amongst all 3.
feature_names = X_train.columns
importances = dtree_tuned1[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
feature_names = X_train.columns
importances = abc_tuned2[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
feature_names = X_train.columns
importances = xgb_tuned2[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
• Total transaction amount is the main feature across all 3 models.
• For the XGBoost model (best model of performace) it is the domiant feature that contirbutes to predicting renouncers of credit card, followed closely by total revolving balance and changes in transaction amounts between Q4 & Q1.
• Addional important features include months of inactivity, relationship count, months on the book and relationship count.
• Its worth noting that a lot of features do not contribute to the prediction of renouncers versus non renouncers in the model.
The model of best performance could be used to help identify certain customers that may renounce their credit card. Given that the main features within this model were transaction amount, revolving balance, changes of tansaction amounts from Q1-Q4, relationship count this can provide some good insight for the bank.
For example, if the bank is able to identify extreme reductions in transaction amounts from customer they could contact the customers to inform them of any additioal offers that they have, or they could host events to promote certin new features. In the modern era a large portion of transaction are completed online. Given this, the bank should ensure that they have a user friendly online banking system. Having a questionaire for feedback may help the bank, as they could identify additional areas they need to improve.
Additionally, the bank should ensure to have a benfits system in play to encourage a customer to make transactions. For example offering a customer 1% credit on each transaction can entice the customer to purchase more. Having addional agreements with restaurants & shpos etc. can help promote additional transactions e.g. Offer 1.5-2% for certain supermarkets.
Revolving balance is the credit balance carried over. This is a strong feature in terms of predicting renouncers of credit cards. It is a bad for customers to increase their revolving balance. Generally this will be expensive as interest rates will increase, which may be a contribution to renouncing due to revolving balance. If the bank can identify substantial increases to a customers revolving balance early, they can attempt to contact the customer prior to increasing interest rates.
Given relationship count is also an important feature the bank also needs to maintain its customers with various accounts/loans. Regular communication and offers to these customers can help keep them satisfied and improve their longevity. Also, we verified from the EDA that there was a positive correlation with attrited customers and the number of contacts i.e. the bank will contact attrited customers ahead of existing customers. However, if the bank can look to contact these customers earlier it may help to prevent any existing customers becoming an attrited customers. The bank should look to become more proactive by contacting its current customers, if they show signs of inactivity and also just in general to ensure that they are satisfied. A satisfactory survey via email or by text message may assist with this.